Credit Card Users Churn Prediction

Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

As a Data scientist at Thera bank we need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

We need to identify the best possible model that will give the required performance

Objective

Data Dictionary:

Let's start by importing necessary libraries

Load and overview the dataset

Observations:

Check the percentage of missing values in each column

Observations:

Let's check the number of unique values in each column

Observations:

Summary of the data

Let's check the count of each unique category in each of the categorical variables.

EDA

Univariate Analysis

Observations on Customer_Age

Observations:

Observations on Dependent_count

Observations:

Observations on Months_on_book

Observations:

Observations on Total_Relationship_Count

Observations:

Observations on Months_Inactive_12_mon

Observations:

Observations on Contacts_Count_12_mon

Observations:

Observations on Credit_Limit

Observations:

Observations on Total_Revolving_Bal

Observations:

Observations on Avg_Open_To_Buy

Observations:

Observations on Total_Amt_Chng_Q4_Q1

Observations:

Observations on Total_Trans_Amt

Observations:

Observations on Total_Trans_Ct

Observations:

Observations on Total_Ct_Chng_Q4_Q1

Observations:

Observations on Avg_Utilization_Ratio

Observations:

Let's define a function to create barplots for the categorical variables indicating percentage of each category for that variables.

Observations on Category Columns

Observations:

Bivariate Analysis

Observations:

Let's define one more function to plot stacked bar charts

Attrition_Flag vs Customer_Age

Attrition_Flag vs Gender

Attrition_Flag vs Dependent_count

Attrition_Flag vs Education_Level

Attrition_Flag vs Marital_Status

Attrition_Flag vs Income_Category

Attrition_Flag vs Card_Category

Attrition_Flag vs Months_on_book

Attrition_Flag vs Total_Relationship_Count

Attrition_Flag vs Months_Inactive_12_mon

Attrition_Flag vs Contacts_Count_12_mon

Observations:

Correlation Heatmap

Observations:

Feature Engineering

Split the dataset into train and test sets

As we saw earlier, our data has missing values. We will impute missing values using median for continuous variables and mode for categorical variables. We will use SimpleImputer to do this.

The SimpleImputer provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.

Let's create dummy variables for string type variables and convert other column types back to float.

Decision Tree Classifier

Cost Complexity Pruning

Let's try pruning the tree and see if the performance improves.

Hyperparameter Tuning

Random Forest Classifier

Hyperparameter Tuning

Bagging Classifier

Hyperparameter Tuning

AdaBoost Classifier

Hyperparameter Tuning

Gradient Boosting Classifier

Stacking Classifier

Calculating different metrics

stacking_classifier_model_train_perf=model_performance_classification_sklearn(stacking_classifier, X_train,y_train) print("Training performance:\n",stacking_classifier_model_train_perf) stacking_classifier_model_val_perf=model_performance_classification_sklearn(stacking_classifier, X_val,y_val) print("Validation performance:\n",stacking_classifier_model_val_perf)

Creating confusion matrix

confusion_matrix_sklearn(stacking_classifier,X_val,y_val)

Comparing all models

Oversampling train data using SMOTE

Logistic Regression on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Decision Tree Classifier on oversampled data

Hyperparameter Tuning

Random Forest Classifier on oversampled data

Bagging Classifier on oversampled data

AdaBoost Classifier on oversampled data

Gradient Boosting Classifier on oversampled data

Regularization

Comparing all models on oversampled data

Undersampling train data using Random Under Sampler

Logistic Regression on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

Decision Tree Classifier on undersampled data

Cost Complexity Pruning on undersampled data

Let's try pruning the tree and see if the performance improves.

Random Forest Classifier on undersampled data

Bagging Classifier on undersampled data

AdaBoost Classifier on undersampled data

Gradient Boosting Classifier on undersampled data

Hyperparameter Tuning

Comparing all models on undersampled data

Comparison on best tuned models

The below 3 models performed well comparitvely on all the metrics so tuned using hyperparameter tuning

Productionize the model: make_pipeline for AdaBoost tuned model on Test data

Business Recommendations